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Abstract 


Pravioaa  work  oa  coaipiiar-tMaatad  aaaltipfo  lastnctioa  ratry  baa  utiliaad  a  aahaa  of  compUar 
traaafefiatioaa.  laap  pratacliaa,  aada  apkOMif ,  aad  laap  rapanwen.  to  atiatiaata  aati-dapandaaciaa 
of  Uagtk  <  S  im  tka  paaaSa  rupiairr.  taackmt  rapaater.  aad  (Ac  paal-paaa  retoiver  phaaaa  of  com- 
pilacioa.  Tka  raaalta  kaac  proaidod  a  maaaa  of  rapidly  racovariag  from  traaawat  procaaaor  failures 
by  roOiaf  back  .V  matractiou.  Tkia  paper  praaaata  tackaiquaa  for  improviag  compilation  and  nin- 
tima  parformaaca  la  compiiar-iamatad  maltipla  iaatractioa  retry.  iBcremaatal  updating  enhances 
compdatioa  tuna  wkaa  aa«  iactmctioiu  are  added  to  the  program.  Poot-paas  code  rescheduling 
aad  spin  ragiiter  raaamgamMt  algorithma  improve  the  ma-tima  performaaca  and  decrease  the  code 
growth  acroaa  tka  appiicatioa  pragraau  stadiad.  Branch  haiards  are  also  shown  to  be  resolvable 
by  simple  modiflcatioas  to  the  iacremeatal  updating  schemas  during  the  pseudo  register  phase  and 
to  tka  spill  register  reessigameat  algorithm  during  the  post-pass  phase. 
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1  Introduction 


Software  baaed  checkpointing  provides  for  rollback  recovery  when  transient  system  faults  occur.  In 
snch  schemes,  a  checkpoint  of  the  system  state  is  captured  and  recorded  at  regular  intervals  [2,  3,  4], 
or  predetermined  positions  in  the  application  program  [5].  In  the  event  of  a  fault,  the  system  can 
be  rolled  back  to  one  of  the  previously  recorded  checkpoints,  returning  the  system  to  a  consistent 
state  [6].  Software  checkpointing  can  accommodate  long  error  detection  latencies  at  the  cost  of 
potentially  long  recovery  time. 

In  contrast  to  fall  software  checkpointing,  multiple  instruction  retry  schemes  aid  in  rollback  of 
just  a  few  instructions,  requiring  shorter  error  detection  latencies  while  resulting  in  less  lost  work 
during  recovery.  Instruction  retry  schemes  have  traditionally  been  implemented  in  hardware,  both 
in  full  checkpointing  [7,  8],  and  in  incremental  checkpointing  (sliding  window)  [9,  10]  formats. 

Recently,  a  compiler-  assisted  multiple  instmetioa  retry  scheme  has  been  developed  in  which 
compiler-driven  data  flow  manipulation  is  used  to  resolve  data  hazards  associated  with  rollback 
recovery  [Ij.  Anti-dependencies  of  length  <  AT  are  eliminated  using  a  series  of  compiler  transfor¬ 
mations.  A  combined  compiler-hardware  scheme  [11]  has  also  been  developed  which  can  remove 
one  type  of  hazard  while  aUowing  the  compiler-driven  transformations  to  resolve  the  remaining 
hazards. 

This  paper  provides  compilation  and  run-time  performance  enhancements  that  have  been  im¬ 
plemented  for  compiler-assisted  multiple-instruction  retry.  The  techniques  described  include  in¬ 
cremental  updating,  post-pass  code  rescheduling,  spill  register  reassignment  and  branch  hazard 
resolution.  Implementation  and  performance  benefits  of  the  schemes  are  evaluated  on  a  set  of 
twelve  programs  which  are  cross-compiled  on  a  SPARC  server  490  and  executed  on  a  DEC  station 
3100. 

2  Error  Model  and  Hazard  Types 

Targeted  processor  errors  are  described  as  follows  [llj.  Error  detection  latency  \s  <  N  instructions. 
Units  external  to  the  CPU,  such  as  memory  and  I/O,  have  their  own  rollback  capability  (e.g., 
delayed  write  buffers  of  depth  N  and  appropriate  bypass  logic).  The  program  counter  contents 
at  each  instruction  are  preserved  by  an  external  recording  device  or  by  shadow  registers  [9|.  A 


Figure  1:  On-path  and  Branch  Hazards. 


restartable  CPU  state  can  be  restored  by  loading  the  correct  contents  of  the  register  file  and  the 
program  counter. 

Given  the  above  assamptions,  a  permissible  error  is  one  which  does  not  resolt  in  a  path  inconsis¬ 
tent  with  the  control  flow  graph  (CFG)  of  the  target  application  program  provided  that  the  register 
file  contents  do  not  spontaaeonsly  change  and  data  is  not  written  to  an  incorrect  register  location. 
Errors  targeted  for  recovery  via  multiple  instruction  retry  are  summarized  as  follows:  1)  CPU 
errors  such  as  those  caused  by  a  faulty  ALU;  2)  incorrect  values  read  from  memory,  the  register 
file,  or  external  functional  units  such  as  the  floating  point  unit;  3)  correct /incorrect  operands  read 
from  incorrect  locations  within  the  I/O,  memory,  or  register  file;  and  4)  incorrect  branch  decisions 
resulting  from  errors  1  through  3. 

The  code  can  be  represented  as  a  CFG,  G(V,£),  where  V  is  the  set  of  nodes  denoting  in¬ 
structions  and  E  the  set  of  edges  denoting  flow  information.  If  there  is  a  direct  control  flow  from 
instructions  /<  to  fj,  where  li  eV  and  Ij  €  V,  then  there  is  an  edge  (/,,/>)  6  E. 

Within  the  general  error  model  above,  data  hazards  resulting  from  instruction  retry  are  of  two 
types  [11].  On-path  hazards  are  those  encountered  when  the  instruction  path  after  rollback  is  the 
same  as  the  initial  instruction  path  and  branch  hazards  are  those  encountered  when  the  instruction 
path  after  rollback  is  different  from  the  initial  instruction  path.  On-path  haizards  can  also  be 
described  as  anti-dependencies  of  length  <  N  in  G(V,E)  [12].  As  shown  in  Figure  1,  register  z 
of  instruction  A  represents  an  on-path  hazard  and  register  y  of  instruction  Ij  represents  a  branch 
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Table  1:  Schemes  implemented 


II  PmiuIo  register 

machine  register 

Nop  insertion 

Scheme  L  11  oe-peth 

oa-path 

on-path 

Scheme  A  H  oe-peth  +  braachl*] 

os-path  -f  branch 

oa-path  -f  branch 

Scheme  0  ||  oa-path 

os- path 

on-path[cr] 

Scheme  1  H  os-path 

1 

os- path 

on-path  -f  branch 

cr 

SchesM  2  II  os-path  +  brasch 

1 

os- path 

on-path  -f  branch 

cr 

Scheme  3  ||  os-path  +  brasch 

os-path  -f  branch 

on-path  -f  branch 

cr 

hazard. 

3  Overview  of  Schemes  Implemented 

In  order  to  compare  compUe  time  and  mn  time  efficiency,  we  have  implemented  several  schemes 
for  each  of  the  phazee,  as  shown  in  Table  1.  Data  hazards  are  resolved  at  three  different  phases. 
The  ptendo  register  phase  employs  loop  protection,  node  splitting,  and  loop  expansion  to  resolve 
hazards  at  the  psendo  register  level.  The  machine  register  phase  performs  register  allocation  to 
resolve  machine  register  hazards,  and  the  nop  insertion  phase  resolves  the  remaining  hazards  by 
inserting  any  required  nops  at  the  assembly  code  level. 

Scheme  L  [1]  resolves  on-path  hazards  only,  and  Scheme  A  [11]  resolves  both  on-path  and  branch 
hazards  at  all  three  phases.  Scheme  A  does  not  resolve  all  psendo  register  branch  hazards  due  to 
loop  expansion,  as  marked  ‘*[*|".  The  dominant  fraction  of  compile  time  in  previous  Schemes  L 
and  A  is  devoted  to  resolving  psendo  register  hazards.  Both  schemes  implement  a  simple  pseudo 
register  phase.  Loop  protection,  node  splitting,  and  loop  expansion  may  insert  new  instructions 
which  can  change  the  loop  structure,  dataflow  information,  and  may  therefore  create  new  hazards. 
Since  the  data  structure  updating  is  not  incrementally  maintained,  both  previous  schemes  repeat 
each  stage  for  all  loops  until  there  are  no  new  instructions  to  insert. 

In  addition  to  previous  schemes  L  and  A,  we  implemented  four  alternative  schemes  that  exploit 
incremental  compilation  techniques.  Scheme  0  uses  incremental  updating  in  the  pseudo  register 
phase  for  resolving  on-path  hazards.  Compilation  time  has  been  enhanced  with  respect  to  Scheme 
L.  Scheme  0  also  employs  post-pass  code  rescheduling  and  spill  register  reassignment  algorithms  to 
enhance  the  run-time  performance  and  decrease  the  code  growth  across  the  application  programs 
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studied.  The  marker  ‘‘[i]”  denotes  incremental  updating,  while  “(cr)”  denotes  code  rescheduling. 
Modifications  to  the  post-pass  algorithms  can  resolve  both  types  of  hazards  during  the  nop  insertion 
phase  (Schemes  1, 2,  and  3).  We  also  show  that  asUghtly  modified  incremental  updating  scheme  can 
resolve  branch  hazards  as  well  in  the  pseudo  register  phase  (Schemes  2  and  3),  though  experimental 
results  favor  Scheme  1  in  code  mn-t*me,  code  growth  and  compilation  speed. 

4  Review  of  the  Pseudo  Register  Phase  in  Scheme  L 

The  following  notation  is  for  on-path  hazards,  while  those  for  branch  hazards  can  be  similarly 
defined.  An  instruction  4  is  a  hazard  tnstructum  if  4  defines  a  register  x,  another  instruction  Ij 
uses  z,  and  there  is  a  directed  path  of  length  less  than  or  equal  to  N  from  4  to  4.  Register  x 
is  called  a  haxard  register  or  a  hazard  that  causes  data  inconsistency.  An  instruction  4 
split  due  to  hazard  register  z  if  z  €  liveJn(Jj)  and  there  is  more  than  one  definition  of  x  that 
can  reach  Ij.  Loop  expansion,  combined  with  renaming,  is  used  to  increase  the  anti-dependency 
distance  to  exceed  N  within  Itxips.  To  prevent  some  loop  headers  from  being  split,  and  to  allow 
the  targeted  hazards  to  be  renamed  freely  after  loop  expansion,  save  and  restore  nodes  are  inserted 
around  loop  headers,  tails,  and  trailer  nodes.  A  loop  can  be  protected  either  from  outside  or  from 
inside.  The  former  executes  save  and  restore  instructions  exactly  once,  while  the  latter  executes 
save  and  restore  instructions  for  every  loop  iteration.  It  is  obvious  that  a  loop  protected  from 
outside  executes  fewer  instructions  when  the  loop  is  executed  at  least  twice.  The  saved  registers 
within  the  loop  are  renamed  to  corresponding  new  registers.  The  following  conditions  au-e  used  to 
determine  if  a  loop  L  should  be  protected  for  register  r: 

Cl.  r  is  a  hazard  register  which  is  live  after  the  extended  loop  L  for  register  r. 

C2.  4’s  header  will  be  split  due  to  its  hazard  register  r. 

C3.  4’s  header  will  be  split  due  to  out  of  loop  hazard  register  r. 

The  extended  loop  4  for  register  r  consists  of  all  nodes  in  4  and  all  nodes  [,  satisfying  the  following 
rules:  1)  r  6  livejn(4).  2)  fi  has  only  one  successor,  3)  4  has  only  one  predecessor  4-  a^nd  4)  4  is 
in  4.  If  Cl  or  C2  is  true,  4  is  protected  from  inside.  If  C3  is  true,  4  is  protected  from  outside.  4 
is  not  protected  for  r  if  none  of  the  three  conditions  hold.  C3  prevents  4’s  header  from  being  split, 
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while  Cl  and  C2  confines  r’s  live  r&nge  to  within  each  iteration  of  L,  so  that  after  loop  expansion, 
r  can  be  renamed  correctly  within  each  new  loop  copy.  Cl  is  for  L  instead  of  L  since  L  may  not 
have  to  be  protected  if  all  nodes  in  which  r  is  live  after  L  are  \n  L  -  L. 

To  limit  the  compilation  time  and  code  growth,  a  threshold  is  set  to  800  instructions  so 
that  the  procedure  aborts  normal  evaluation  of  the  loops,  and  simply  inserts  enough  nops  to 
resolve  the  remaining  hazards  should  the  code  size  exceed  the  threshold.  Since  we  will  illus¬ 
trate  onr  ideas  using  real  program  segments,  we  list  the  benchmarks  with  code  sizes  and  descrip¬ 
tions  as  follows  :  QUEEN(148),  8-qneen  program;  QSORT(261),  recursive  quick  sort  algorithm; 
PUZZLE(877),  a  game;  WC(181),  CMP(251),  GREP(926),  COMPRESS(1828),  UNIX  utilities; 
EQN(6251),  mathematics  typesetting  program;  L£X(6873),  lexical  analyzer;  YACC(8099),  parser 
generator;  CCCP(8775),  preprocessor  for  gnu  C  compiler;  and  TBL(9191),  table  formatter. 

5  Performance  Enhancement  Techniques 

In  this  section,  we  discuss  techniques  that  can  enhance  code  run  time  and  reduce  code  growth  at 
the  pseudo  register  phase.  Loop  L  is  protected  from  inside  for  register  r  if  condition  Cl  or  condition 
C2  is  true.  However,  L  may  have  a  special  property  which  allows  the  save/restore  nodes  for  r  to  be 
moved  out  of  the  loop.  This  can  save  code  run  time  since  the  save/ restore  nodes  are  only  executed 
once  for  every  iteration  of  L.  For  register  r,  if  any  header  to  tail  path  within  loop  L  has  at  least  one 
instructions  defining  r,  then  there  exists  suitable  renamings  along  certain  cut  line  on  these  paths 
after  L  has  been  expanded.  We  introduce  the  notions  of  the  cut  register  set  and  the  cut  node  set 
as  follows: 

DeAnhion  (1)  HRi  and  ff  N,  are  the  set  of  hazard  registers  and  hazard  nodes  respectively 
within  loop  L,.  (2)  CRi  is  the  cut  register  set  of  loop  L,.  Register  r  s  CR,,  iff  any  directed  path 
leading  from  L/s  header  to  some  tail  has  one  or  more  instructions  defining  r.  (3)  CH  R^  is  the  cut 
hazard  register  set  of  loop  X,.  r  €  CHRi,  if  r  €  CRi  and  r  €  HRi-  (4)  CNLi{r)  is  the  cut  node  set 
of  loop  Li  for  register  r.  For  any  X,’s  loop  node  a,a£CN X,(r),  iff  r  ^  CR^,  a  defines  r  and  there 
exists  at  least  a  directed  path  from  a  to  at  least  one  of  L,'s  tails  that  no  node,  except  a,  on  this 
path  defines  r.  (5)  Let  denote  the  minimum  number  of  edges  on  any  path  within  loop  L 

from  /,  to  Ij,  and  Di  denote  the  minimum  number  of  edges  from  L's  loop  header  to  any  of  L's  taiil. 
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{lath)  w  a  hazard  pair  within  loop  L  on  register  r  if  /„  uses  r,  Ip  defines  r,  and  dL(Ia,Io)  <  N. 

5.1  Loop  Protection 

Consider  that  loop  X,  needs  to  be  protected  from  inside  for  r.  If  r  &  HRi,  and  r  ^  CRi,  then  the 
save  node  r' «—  r  is  in  CNLi{r'),  and  all  the  restore  nodes  r  «—  r'  are  in  CNLi(r).  Note  that  r'  is 
a  new  register  and  all  references  of  r  within  Li  are  renamed  to  U  r  £  CH Ri,  then  loop  X,  can 
be  renamed  correctly  after  being  protected  from  outside  for  r  and  expanded  a  sufficient  number  of 
times.  Similarly,  if  r  6  HRi,  r  ^  CRi,  and  r  is  dead  at  the  header  of  Li,  then  Li  can  be  renamed 
correctly  after  being  protected  from  outside  for  r  with  the  removal  of  the  save  node  and  expanded 
a  sufficient  number  of  times. 

Example  Figure  2(a)  is  a  program  segment  for  the  first  function  of  QSORT.  A  sign  within 
a  circle  denotes  a  hazard  node,  and  U{x)  represents  using  register  x.  Dotted  lines  denote  that  there 
may  be  some  nodes  in  between  as  long  as  they  do  not  redefine  the  targeted  registers.  Solid  lines 
denote  no  instructions  in  between,  and  dashed  lines  denote  cut  lines.  If  we  apply  the  original  loop 
protection  algorithm,  the  loop  is  protected  from  inside  for  both  registers,  as  shown  in  Figure  2(b). 
Every  loop  iteration  executes  four  additional  instructions.  We  can  move  the  save/restore  nodes  for 
register  a  out  of  the  loop  since  they  belong  to  the  cut  hazard  register  set  of  the  loop.  Figure  2(c) 
illustrates  such  transformation,  and  now  every  loop  iteration  executes  two  additional  instructions. 

5.2  Node  Splitting 

For  any  given  node  I,  let  X  be  the  number  of  incoming  edges,  and  Af  be  the  number  of  original 
reaching  definitions  that  can  reach  I,  among  which  K  definitions  are  hazards.  We  have  implemented 
a  scheme  in  which  the  number  of  copies,  5,  for  node  I  after  splitting  satisfies  the  following  criterion: 
1)  S  =  K,  if  M  =  K;  OT  2)  S  =  K  +  l,  \f  M  >  K. 

This  can  be  done  by  using  a  stamp  heap  data  structure  [13],  so  that  if  a  hazard  node  I  is  split 
into  /i ,/],...,  /si  then  the  stamp  field  of  X,  t  =  2, 3, . . . ,  5,  points  to  /q.  The  hazard  nodes  in  the 
same  heap  will  be  assigned  to  the  same  new  destination  register  if  renaming  is  required.  Therefore  a 
node  /  with  hazard  z  in  its  liveJn  set  will  be  split  if  I)  there  are  more  than  one  reaching  definitions 
of  z,  and  2)  all  reaching  definitions  of  z  do  not  belong  to  the  same  stamp  heap,  assuming  that  all 
non-hazard  nodes  defining  z  belong  to  the  same  stamp  heap. 
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5.3  Loop  Processing  Order 

Node  splitting  transfonns  all  the  hazards  within  the  current  loop  across  its  backedges,  while  loop 
expansion  resolves  all  snch  hazards.  In  this  manner,  when  we  process  a  given  loop,  there  is  no  data 
hazard  across  the  backedges  of  its  inner  loops.  Therefore  it  is  natural  to  process  the  loops  from 
inside  out  so  that  the  levels  of  data  hazards  can  be  successively  reduced  until  all  of  them  occur  at 
the  root  level.  The  hazards  at  the  root  level  then  can  be  resolved  by  node  splitting  and  renaming. 

In  addition  to  the  inner  loop  first  rule,  we  have  to  enforce  the  sequential  order  rule  (  top-down 
)  to  smoothly  check  condition  C3  in  Section  1  for  parent  hazard  registers  and  to  further  eliminate 
extra  save/restore  nodes.  Suppose  that  hi  and  are  loop  headers  of  Li  and  L2  respectively,  and 
there  is  a  directed  path  from  hi  to  h}  without  visiting  any  backedge.  According  to  the  loop  nesting 
property,  we  have  two  cases  :  1)  £3  is  an  inner  loop  of  Li;  or  2)  L2  is  sequentially  after  Li,  as  shown 
in  Figure  3(a),  and  3(b).  For  case  1,  L2  is  processed  before  Li,  while  for  case  2,  Li  is  processed 
before  L2.  If,  without  visiting  any  backedge,  there  is  no  directed  path  from  hi  to  /12  or  from  hj 
to  hi,  then  either  Li  or  Lj  can  be  processed  first,  as  shown  in  Figure  3(c).  Figure  3(d)  illustrates 
the  inner  loop  first  rule,  that  new  hazards  due  to  loop  protection  will  be  propagated  to  the  outer 
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loop.  Figure  3(e)  is.  a  real  program  segment  (  the  first  function  of  CMP  )  and  can  illustrate  the 
sequential  order  rule.  Suppose  L2  is  processed  first.  Without  enforcing  the  sequential  order  rule, 
L2  may  need  to  be  protected  from  outside  for  register  r,  as  shown  in  Figure  3(f).  However,  such 
protection  is  redundant  if  we  process  L\  first  and  remove  its  hazards  that  might  affect  L2,  as  shown 
in  Figure  3(g). 

Breadth  First  Search  (BFS)  is  used  to  determine  the  processing  order  of  nodes  within  loops  or 
nodes  of  the  entire  program.  The  starting  nodes  may  be  the  headers  of  loops  or  the  root  of  the 
program.  For  some  procedures,  we  have  to  modify  the  BFS  algorithm  by  enforcing  the  following 
rules  :  1)  a  node  can  be  processed  if  and  only  if  all  of  its  parents  have  been  processed,  MBPS;  2) 
bypass  inner  loops,  BBFS;  3)  reverse  the  direction  of  searching,  RBFS. 
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5.4  Loop  Expansion 


Our  formula  for  the  number  of  copies  of  L  needed  to  resolve  adl  on-path  hazards  within  L  is  different 
from  the  previous  work  [1]  due  to  the  cut-register-set.  To  simplify  the  analysis,  we  assume  that  loop 
L  has  a  header  A,  and  a  single  tail  /«.  It  can  be  easily  extended  to  loops  with  multiple  tails.  Let 
Di  =  d( /*,/().  Assume  that  {liilj)  is  a  hazard  pair  within  loop  L  for  register  x.  The  new  formula 
includes  the  following  cases:  Case  1.  The  backedge  (It,  Ik)  is  not  counted  in  Ij);  Case  2.  The 
backedge  {It, Ik)  is  counted  in  di{Ii,Ij),  and  within  L  there  exists  a  directed  path  that  does  not 
include  (It,  Ik)  fifom  Ij  to  /„•  and  Case  3.  The  backedge  (/t,  A)  is  counted  in  di(/,,/,),  and  within 
L  not  considering  {It,  Ik),  there  is  no  directed  path  from  Ij  to  /,. 


Suppose  it  takes  K\,  Kj  and  K3  copies  to  resolve  the  hazard  pair  {Ii,Ij)  for  each  case  respec¬ 
tively.  We  have  A'j  =  j  4. 2  ,  and 


Ki  =  K3^ 


2  ,i{d{Ii,It)  +  d{Ik,Ij)+l>  N 

.otherwise. 


The  number  of  copies  of  L  needed  to  resolve  all  hazards  within  L  is  the  maximum  of  all  such 
K's.  Note  that  the  number  of  expansions  is  at  least  2. 


5.5  Self-Anti-Dependent  Instructions 

An  instruction  /  is  self-anti-dependent  if  I  uses  the  definition  that  it  defines.  For  example,  x  *—  x+a 
is  a  self-anti-dependent  instruction  that  defines  and  uses  pseudo  register  x.  This  type  of  anti¬ 
dependency  can  be  resolved  by  splitting  I  into  two  instructions  :  (  /i  :  y  i—  x  +  a,  I2  :  x  *—  y  ), 
and  then  inserting  N  nops  between  them  [1,  11].  However,  using  renaming  with  the  aid  of  node 
splitting  and  loop  protection,  we  can  rename  the  definition  of  x  to  a  new  pseudo  register  without 
inducing  one  new  instruction. 


6  The  Incremental  Updating  Scheme 

6.1  For  On-Path  Hazards  -  Scheme  0 

Figure  4  shows  the  flowchart  of  the  incremental  scheme  for  on-path  hazards  during  the  pseudo 
register  phase.  Three  subroutines  loop-protection,  node-splitting,  and  replicate-loop,  marked  by 
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Figure  4:  Incremental  updating  for  on-path  hazards 


may  insert  extra  nodes  around  or  within  loops.  Information  associated  with  each  node,  including 
register  live  range,  stamp  heap  and  loop  structure,  is  updated  locally  whenever  a  node  is  inserted. 

To  determine  if  the  header  of  an  inner  loop  Li  will  be  split  due  to  some  parent  loop  hazard 
register  r,  we  have  to  check  the  nodes  outside  of  Li  (  condition  C3  ).  We  confine  the  search  for 
such  hazard  pairs  to  across  Li,  or  for  hazard  nodes  to  within  Li’s  immediate  parent  loop,  but  not 
within  Li.  Assume  that  Lj  is  Li’s  immediate  parent  loop  or  the  entire  program  (  root  level  )  if  L, 
has  no  parent  loop,  and  (7®,//?)  is  a  hazard  pair  for  register  r.  The  two  cases  in  which  we  consider 
protecting  Li  from  outside  for  r  are  shown  in  Figure  5(a)  and  (b).  In  Figure  5(a),  since  the  r  in 
Iff  will  be  renamed,  we  only  need  to  check  if  there  is  any  other  definition  of  r,  1^,  that  can  reach 
Ih,  and  is  not  in  the  same  stamp  heap  as  Ip.  The  search  for  Ig  is  restricted  to  the  shaded  area, 
denoting  the  definitions  within  Lj  that  can  reach  In  without  going  through  backedges,  but  can 
be  nodes  in  the  upper  levels  that  can  reach  I\.  The  hazard  in  Figure  5(b)  can  also  be  resolved 
by  expanding  £,  a  sufficient  number  of  times  and  renaming  registers  within  L,.  For  simplicity,  we 
protect  Li  from  outside  for  register  r  instead,  so  that  the  hazard  is  automatically  resolved.  For 
other  cases,  the  paths  that  cause  hazards  either  belong  to  the  current  loop,  which  is  detectable 
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(a)  Huaid  spiitling  (be  loop  beader.  (b)  Aaou  loop  oo-paib  bazsd  Ic)  Actom  loop  onocb  luzara 
Figure  5:  The  confinement  of  search  nodes  outside  the  loop 


when  we  process  the  current  loop,  or  belong  to  outer  loops,  which  will  also  be  found  when  *>- 
handle  the  outer  loops  subsequently. 

6.1.1  Prepsu«tion 

Subroutines  rtnaming,  live-analysia,  record~loop-8trw:tun,  and  3ort~loop  are  executed  only  c^nre 
The  incremental  scheme  does  not  perform  global  DU-chain  and  global  reaching  definition  analysts 
as  Scheme  L  does,  but  rather  performs  a  global  live  range  analysis.  After  the  preparation,  loop 
information  and  dataflow  information  liveJn  and  livejyut  are  maintained  and  updated  locally 
throughout  the  computation.  Loop  processing  order  is  determined  by  a  sort-loop  subroutine, 
which  evaluates  loops  in  a  top-down  order  in  addition  to  the  inner  loop  first  rule. 

6.1.2  Main  Loop 

The  primary  functions  of  loop  expansion  include  :  1)  compute  the  number  of  copies  needed  to 
resolve  all  on-path  hazards  within  loops;  2)  replicate  loops;  and  3)  rename  all  registers  within  loops. 
When  the  compute-hazard  subroutine  bypasses  inner  loop  hazards,  functions  2  and  3  can  be  moved 
out  of  the  main  loop  without  affecting  the  correctness  of  the  new  scheme.  Therefore,  the  main 
loop  consists  of  compute-hazard  ,  loop-protection  ,  get-number-of-replications  .  and  node -splitting 
subroutines,  and  each  iteration  evaluates  one  loop,  as  shown  in  Figure  4.  This  strategy  is  efficient 
since  the  actual  code  growth  (function  2),  is  outside  the  main  loop. 

Subroutine  compute-hazard  computes  HRi,  and  HNi,  bypassing  inner  loop  hazards.  It  traverses 
nodes  within  Li  from  the  loop  header  in  a  BFS  order.  If  node  I  defines  i,  it  performs  an  RBFS 
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traversal  from  node  /  up  to  distance  N,  but  the  search  never  leaves  Li-  x  £  H R,  and  /  €  H  N, 
iff  there  is  a  use  of  x  within  distance  N .  Subroutine  loop-protection  protects  loop  Z,,  according  to 
criteria  Cl,  C2,  and  C3. 

Subroutine  get-number-of-replications  performs  a  BFS  traversal  to  compute  Dh,0  and  an  RBFS 
traversal  to  compute  for  all  nodes  la.  Ip  in  Li-  It  then  computes  ij  ^  ^ 

according  to  the  new  formula,  for  every  hazard  pair  {la,  Ip)  in  X,.  The  maximum  of  all  such  values 
is  the  number  of  replications  needed  for  L,  to  resolve  its  hiizards. 

Subroutine  node-splitting  performs  reaching  definition  analysis  within  loops  in  an  MBFS  order. 
Nodes  that  have  multiple  reaching  definitions  and  at  least  one  of  them  is  a  hazard  node  are  split. 
The  nodes  in  inner  loops  are  bypassed  to  save  the  execution  time.  An  inner  loop  header  has  multiple 
incoming  edges,  but  it  will  not  be  split  due  to  the  loop  protection.  When  a  node,  I,  is  split  to 
several  copies,  each  new  copy  has  a  pointer  (stamp  )  linking  it  to  /.  Reaching  definitions  from  nodes 
that  belong  to  the  same  stamp  heap  are  considered  the  same  reaching  definition.  This  implements 
the  criterion  mentioned  in  Section  5.2.  Subroutine  replicate-loop  first  marks  the  extended  loop  L, 
for  ail  hazard  registers,  and  then  applies  a  BFS  traversal  to  replicate  X,.  The  number  of  copies  is 
obtained  from  get-numher-of-replications  subroutine. 

0.2  Incorporating  Branch  Haaards  -  Schemes  2  and  3 

Branch  hazards  occur  at  branch  boundaries  when  an  error  results  in  a  wrong  branch  decision.  The 
following  criterion  can  be  used  to  locate  all  branch  hazards  :  Register  z  is  a  branch  hazard  if  there 
exists  a  branch  node  Ign,  such  that  the  distance  fiom  Ibr  to  a  definition  of  z  along  one  branch 
path  of  Ibr  is  within  N,  and  z  is  live  at  the  other  branch  paths  of  Ibr-  Similar  to  the  case  shown 
in  Figure  5(b),  we  need  to  modify  the  loop  protection  criterion.  As  shown  in  Figure  5(c|.  is  a 
branch  node  that  does  not  use  register  r,  and  r  is  live  along  one  branch  path  of  la-  Loop  L,  is 
protected  from  outside  for  register  r,  as  if  branch  node  la  uses  register  r. 

By  viewing  z  as  if  it  is  used  at  [rr,  renaming  can  resolve  branch  hazards  as  well  as  on-path 
hazards.  We  use  the  following  example  to  illustrate  the  idea. 

Exjunple  Consider  the  partial  segment  of  EQN.  as  shown  in  Figure  6i  ai.  and  .V  =  4  Register 
z  at  node  I  is  a  branch  hazard  due  to  branch  nodes  [rr  and  I,  .\fter  loop  protection  as  in 
Figure  5(c),  and  renaming  z  to  y,  the  new  register  y  at  node  /  is  a  branch  hazard  due  to  branch 


12 


(a)  EQN  segment,  N  a  4.  (b)  After  loop  protection  on  x.  (c)  After  expanding  twice  and  renaming. 

Figure  6:  The  loop  expansion  for  branch  hazards 


node  It,  as  shown  in  Figure  6(b).  Note  that  the  save  instruction  y  *-  x  before  the  loop  header  Ik 
is  removed  since  x  is  not  live  at  A.  In  Figure  6(c),  by  expanding  the  loop  twice  and  renaming,  the 
branch  hazard  is  resolved.  The  formula  for  the  number  of  loop  replications  can  also  be  modified  by 
viewing  the  branch  node  as  using  the  hazard  register  x. 


7  Post'Pass  Code  Rescheduling  and  Spill  Register  Reassignment 

7.1  On-Path  Hazards  -  Scheme  0 

Although  the  pseudo  register  phase  aims  at  removing  on-path  hazards  within  a  function,  new 
hazards  may  emerge  after  machine  renter  phase.  First,  the  stack  pointer  adjustment  instructions 
within  the  prologue  segment  and  the  epilogue  segment  create  immediate  anti-dependencies  of  length 
1.  Second,  before  calling  a  procedure,  the  registers  used  as  parameters  need  to  be  saved  before  the 
new  values  can  be  loaded.  Register  spilling  may  also  create  on-path  hazards.  When  a  register  is  to  be 
spUled,  most  likely  it  will  be  loaded  with  new  values,  thus  creating  a  use-before-definition  scenario. 
A  straightforward  post-pass  nop  insertion  algorithm  was  employed  in  Scheme  L  to  resolve  these 
new  hazards.  Sufficient  nops  are  inserted  before  the  hazard  definitions  to  force  all  anti-dependency 
distances  exceeding  N . 

In  this  section,  we  apply  a  code  rescheduling  technique  within  the  prologue  and  the  epilogue 
segments,  and  a  register  reassignment  algorithm  for  rearranging  spill  registers,  so  that  the  total 
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number  of  nops  inserted  is  greatly  reduced.  The  post-pass  algorithm  includes  the  following  steps  ; 
1)  reassign  spill  registers;  2)  reschedule  code  and  insert  nops  in  the  prologue  segment;  3)  reschedule 
code  and  insert  nops  in  the  epilogue  segment;  and  4)  insert  remaining  nops. 

IMPACT  C  compiler  reserves  three  registers  as  spill  registers,  i.e.,  $3,  $24,  and  $25.  The 
spill  registers  perform  two  functions  to  access  memory,  load  and  store.  The  compiler  generates 
instructions  of  the  following  groups  for  load  and  store  functions  respectively,  where  $ri  and  $r2  are 
different  spill  registers,  and  are  dead  after  the  second  (  or  the  third  )  instruction  : 

load  $ri,  memory;  load  Irj,  memoryi;  operation  defining  $ri; 

use  $ri;  load  Sfj,  memory^;  store  $ri,  memory; 

use  $ri,  Irj; 

Spill  registers  are  served  as  temporaries  and  have  very  short  live  ranges,  i.e.,  2  or  3.  On-path 
hazards  occur  when  two  groups  of  spill  code  use  the  same  spill  register  and  their  distance,  from  the 
use  of  the  first  group  to  the  definition  of  the  second  group,  is  less  than  or  equal  to  iV.  All  groups 
of  spill  code  can  be  easily  identified.  The  goal  is  to  minimize  the  number  of  nops  needed  to  resolve 
all  hazards.  Our  approach  is  to  utilize  dead  rasters  as  substitutes  within  groups  so  that  the  sum 
of  all  the  anti-dependency  distances  for  spill  registers  and  substitutes  is  maximized,  considering 
the  anti-dependency  distance  between  groups  of  different  spill  registers  and  substitutes  iV  -|-  1. 
In  general,  this  problem  is  NP-hard,  which  includes  as  a  special  case  the  following  NP-complete 
problem  after  fixing  that  only  spill  registers  are  dead  registers,  and  iV  =  1  ; 

Given  K  colors,  an  undirected  graph  G  and  an  integer  n,  is  there  a  node  coloring  such 
that  the  number  of  edges  xoith  the  same  colors  at  both  ends  is  at  most  n? 

This  can  be  proven  by  restricting  n  to  0,  and  it  becomes  the  ff-colorability  problem  [16]. 
However,  we  propose  a  simple  heuristic  algorithm  to  reassign  spill  registers  within  groups  in  a  BPS 
traversal  of  the  entire  program.  We  always  choose  as  a  substitute  the  register  which  is  dead  before 
and  after  the  group,  and  whose  sum  of  the  distance  backward  to  the  first  use  and  the  distance 
forward  to  the  first  definition  is  maximum. 

The  prologue  segment  includes  code  to  adjust  the  stack  pointer  and  to  save  the  values  of 
some  local  registers  to  memory,  while  the  epilogue  segment  includes  code  to  retrieve  the  original 
values  of  the  same  local  registers  from  memory  and  to  adjust  the  stack  pointer.  We  illustrate  the 
improvement  to  the  epilogue  segment  by  an  example,  while  the  prologue  segment  can  be  similarly 
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Fignre  7:  Post-pass  code  rescheduling  for  the  epilogue  segment  of  QSORT,  iV  =  10 


done.  Figure  7(a)  shows  the  epilogue  segment  of  the  second  function,  merge-sort,  of  QSORT, 
for  N  =  10.  Figure  7(b)  illustrates  how  the  register  assignment  and  code  rescheduling  are  used 
to  eliminate  16  nops  in  the  epilogue  segment.  Instruction  ’addu  $30,  Ssp,  128’  has  been  moved 
backward  up  to  before  all  instructions  of  loading  local  registers,  with  the  base  register  being  replaced 
by  $30.  The  instructions  to  load  local  registers  are  rescheduled  according  to  their  distances  from 
the  first  uses  of  corresponding  registers.  Since  registers  $16,  $17,  $20,  and  $23  all  have  distance 
1,  they  are  moved  to  the  end  of  the  load  instructions.  Four  more  nops  are  needed  to  resolve  the 
hazard  register  $23. 

The  code  rearrangements  within  the  prologue  and  the  epilogue  segments  will  not  create  on- 
path  hazards  across  procedural  boundaries,  since  we  can  consider  a  subroutine  call  as  a  single 
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instruction  using  the  register  that  holds  the  return  address,  e.g.,  register  $31  in  IMPACT  C.  The 
last  step  simply  performs  a  BFS  traversal,  and  inserts  the  required  nops  to  resolve  all  remaining 
on-path  hazards. 

7.2  Both  Types  of  Hazards  —  Schemes  1,  2,  and  3 

Post-pass  nop  insertion  can  also  resolve  extra  branch  hazards  generated  by  the  machine  register 
allocator.  The  branch  hazard  check  can  be  incorporated  in  the  original  on-path  hazard  check.  The 
heuristic  to  reassign  spill  registers  has  to  be  modified  as  follows.  The  register  we  choose  to  replace 
the  reserved  spill  register  at  a  specific  group  G  of  spill  instructions  must  be  not  only  dead  before 

f 

and  after  G,  but  also  requires  as  few  nops  as  possible  to  resolve  the  new  branch  hazard  induced 
by  the  substituting  register.  This  can  be  achieved  by  applying  an  RBFS  traversal  from  the  first 
instruction  of  G,  up  to  distance  N.  For  every  branch  node  Ibr  visited,  and  for  those  registers  which 
are  live  at  the  other  branch  of  IbRi  set  “the  distance  backward  to  the  first  use”  in  the  heuristic  to 
the  distance  from  Ibr  to  G,  as  if  those  registers  are  used  at  Irr-  In  the  last  step,  we  insert  nops 
to  resolve  the  remaining  on-path  hazards  and  bramch  hazards. 

The  above  schemes  for  incorporating  branch  hazard  resolution  do  not  create  extra  hsizards 
across  procedural  boundaries.  However,  depending  on  implementations,  the  callee-saved  registers 
may  produce  a  performance  impact  due  to  separate  compilations. 

As  shown  in  Figure  8(a),  suppose  at  branch  node  /,  a  wrong  decision  is  made.  After  rollback 
and  a  correct  decision  at  /,  register  Sr  has  a  wrong  value.  If  Sr  is  in  y's  callee-saved  register  set, 
then  Sr  is  live  along  /’s  target  (T)  branch.  Several  nops  should  be  inserted  between  I  and  J  to 
resolve  such  branch  hazard.  However,  since  V’s  callee-saved  register  set  are  unknown  at  current 
procedure  X,  a  conservative  scheme  may  assume  that  all  the  potential  registers  are  in  the  set,  e.g., 
S16,  Sl7,  •  •  •,  S23  in  IMPACT  C.  By  viewing  A*  as  a  node  that  uses  such  set,  we  can  incorporate  it 
in  the  initial  global  live  range  analysis. 

To  relief  the  situation,  certain  remedies  can  be  implemented.  For  library  routines,  a  built-in 
table  holding  corresponding  saved  register  sets  can  be  attached  to  the  compiler.  The  following 
checking  can  terminate  $r’s  live  range  before  the  procedure  call,  regardless  of  whether  Sr  belongs 
to  the  callee-saved  register  set.  $r  €  liveJn{M)  iff  $r  is  live  at  node  K,  where  M  is  the  next 
instruction  following  the  subroutine  call  node  K.  Such  live  range  checking  starting  from  M  should 
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Figure  8:  Register  live  range  across  procedure  boundairies  ■ 


skip  any  subroutine  call  encountered. 

Example  Figure  8(b)  is  an  assembly  code  segment  for  the  recursive  function  try  of  QUEEN. 
Without  checking  the  additional  condition,  N  nops  are  inserted  between  node  I  2md  node  J  to 
eliminate  the  hazard  $18.  None  is  required  by  observing  $18  is  dead  after  node  K.  Code  run  time 
performance  is  improved  since  apparently  such  N  nops  are  within  a  loop. 


8  Performance  Evaluation 

8.1  Resolving  On>Path  Hazards  -  Scheme  0  v.s.  Scheme  L 

The  incremental  updating  scheme  and  the  postpass  code  rescheduler  improve  application  compile 
time,  run>time  performance,  and  reduce  code  growth  for  most  applications  studied.  In  this  section 
we  compare  the  performance  impact  of  Scheme  0  and  Scheme  L  with  respect  to  the  compile  time, 
code  run  time  and  code  size.  For  the  comparison  purpose,  we  investigate  the  same  set  of  benchmarks 
used  in  (1):  CMP,  COMPRESS,  PUZZLE,  QSORT,  QUEEN,  and  WC. 

Scheme  0  finishes  compilation  for  all  benchmarks  within  a  short  time.  For  N  =  10,  Scheme 
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Table  2:  Code  run  time  overhead 


N 

1  1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

QSOKT 

19 

1  6.2% 

8.3% 

8.3% 

10.4% 

11.5% 

13.5% 

14.6% 

26.0% 

22.9% 

30.2% 

O 

\wgmm 

msm 

mm 

10.4% 

10.4% 

13.5% 

15.6% 

16.7% 

QUEEN 

19 

mmm 

MFKl 

mW!iM 

■nna 

11.5% 

15.8% 

mmsm 

20.9% 

O 

wmM 

mmm 

warn 

10.2% 

16.3% 

CMP 

19 

1  -1.8% 

-1.8% 

ee^ 

-1.8% 

-1.8% 

-1.8% 

-1.8% 

-1.8% 

-1.8% 

O 

wssm 

-2.4% 

-2^4% 

-2^4% 

-2.4% 

-2.4% 

WC 

a 

1  3.8% 

3.8% 

3.8% 

wBm 

3.8% 

■E^ 

3.8% 

3.8% 

3.8% 

4.4% 

o 

IMiTiVa 

mfmm 

1.3% 

PUZZLE 

19 

ite'hkin 

fc'&ifil 

KfWtrm 

-0.T% 

-0.7% 

-0.7% 

mkkm 

-0.7% 

O 

WSSMM 

tBSMM 

tEsm 

mwtm 

COMPRESS 

19 

Minpkim 

■ISE9 

1.2% 

5.6% 

6.2% 

11.2% 

18.8% 

O 

BiMl 

mmm 

12% 

mnfVM 

5.6% 

10.6% 

16.9% 

L  spends  more  than  8  minutes,  15  seconds,  1.5  minutes,  3.5  minutes,  and  9.5  minutes  to  com¬ 
pile  benchmarks  QSORT,  QUEEN,  CMP,  WC,  and  PUZZLE  respectively,  while  Scheme  0  takes 
compile  time  less  than  16  seconds,  8  seconds,  15  seconds,  15  seconds  and  50  seconds  respectively. 
COMPRESS  has  the  best  compile  time  improvement.  Scheme  L  spends  more  than  an  hour  for 
iV  s  7,  8,  and  9,  and  almost  two  hours  for  N  s  10  to  compile,  while  Scheme  0  takes  compile  times 
all  within  3  minutes. 

Table  2  lists  code  run  time  overhead  using  both  schemes  respectively.  Both  schemes  pass  through 
pseudo  register  and  machine  register  anti-dependency  resolvers  and  the  nop  inserters,  generating 
code  free  from  anti-dependencies.  Rows  marked  and  “0”  include  code  run  time  overhead  of 
Scheme  L,  and  Scheme  0  respectively. 

Let  TOi  and  TUi  be  Scheme  L  code  run  time  and  Scheme  0  code  run  time  respectively  for 
anti-dependency  distance  i.  The  run-time  enhancement  factor  is  defined  as  -  ,  for  i  - 

1,2, •  •  •,  10,  and  is  plotted  in  Figure  9.  Two  benchmarks,  QSORT,  and  QUEEN,  include  recursive 
functions  and  have  among  the  largest  run-time  enhancement  factors,  for  N  >  5.  Post-pass  code 
rescheduling  contributes  most  to  these  benchmarks. 

Table  3  lists  the  code  size  overhead  using  both  schemes  respectively.  Let  50,  and  Si\  be 
Scheme  L  code  size  and  Scheme  0  code  size  respectively  for  anti-dependency  distance  i.  From  the 
last  column,  the  overheads  are  within  250%,  for  N  =  10. 

The  size  enhancement  factor  is  defined  for  i  =  1,2,  •  •  •,10,  and  is  plotted  in 
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Figure  9:  Run-time  enhancement  -  Scheme  0  v.s.  Scheme  L 

Figure  10.  COMPRESS  has  negative  size  enhancement  factors  due  to  the  following  reasons  ;  1 ) 
the  enhanced  scheme  removes  the  800  instruction  threshold  [1]  which  allows  further  code  growth; 
and  2)  one  function  enters  simplified  mode  for  N  >  0,  and  there  are  two  when  N  >  5.  For  N  =  I 
and  2,  QSORT  and  WC  have  negative  size  enhancement  factors.  This  is  because  proper  renaming 
after  protecting  loop  L  from  inside  and  node  splitting  for  small  JV  may  prevent  loop  L  from  being 
expanded,  while  using  cut  register  set  technique  to  move  save/restore  nodes  out  of  the  loop  L 
requires  L  to  be  expanded  at  least  once. 

8.2  Resolving  On-Path  and  Branch  Hazards  -  Schemes  1,  2,  and  3 

Schemes  1,  2,  and  3  deal  with  removing  both  types  of  hazards  during  three  separate  phases.  Scheme 
1  has  the  fastest  compilation  speed  since  it  postpones  the  bramch  hazard  resolution  to  the  last  phase, 
i.e.,  nop  insertion. 

All  three  schemes  perform  relatively  the  same  for  the  twelve  benchmarks  studied.  The  reasons 
may  be  1)  the  occurrences  of  branch  hazards  are  less  frequent;  2)  both  machine  register  and  nop 
insertion  phases  employ  heuristics,  and  the  spill  register  reassignment  heuristic  may  be  efficient 
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Table  3;  Code  size  overhead 


iL_ 1 

1 

2 

3 

4 

- T 

6 

7 

8 

- r 

10 

QSORT 

■91 

62.5% 

69.7%  1 

104.6% 

114.6% 

123.4% 

136.0% 

154.4% 

199.2% 

218.8% 

273.9% 

OI 

101.1% 

wsmM 

109.6% 

118.0% 

130.3% 

138.3% 

146.4% 

168.6% 

190.8% 

QUEEN 

m 

56.8% 

68.9% 

124.3% 

133.8% 

152.0% 

164.2% 

176.4% 

208.1% 

218.9% 

309.5% 

OI 

53  4% 

58.1% 

68.2% 

78.4% 

127.0% 

132.4% 

147.3% 

151.4% 

179.1% 

CMP 

ill 

IKE^ 

KS&l 

106.8% 

140.6% 

158.2% 

179.3% 

199.6% 

227.5% 

OI 

1  60.2%  1 

1  63.3%  1 

76.1% 

1  81.7%  1 

83.7% 

87.6% 

90.4% 

121.5% 

WC 

191 

I  132.6% 

159.7% 

166.9% 

215.5% 

244.8% 

248.7% 

256.9% 

289.5% 

OI 

162.6% 

164.1% 

165.2% 

187.3% 

205.0% 

208.8% 

244.2% 

PUZZLE 

OI 

80.3% 

89.4% 

90.8% 

93.7% 

101.1% 

105.9% 

126.0% 

OI 

\WBiSiiM 

78.9% 

1  80.5%  1 

84.0% 

84.5% 

86.7% 

95.6% 

99.1% 

100.5% 

111.2% 

COMPRESS 

OI 

\^ESEM 

31.6% 

52.4% 

60.1% 

69.0% 

80.0% 

93.9% 

106.5% 

129.0% 

OI 

■QSJciJI 

mmM 

82.0% 

86.6% 

107.8% 

122.4% 

151.8% 

156.1% 

Table  4:  Run  time  overhead  for  Scheme  1 


6.2% 

6.2% 

7.3% 

9.4% 

9.4% 

12.5% 

12.5% 

16.7% 

18.7% 

18.7% 

QUEEN 

2.8% 

3.1% 

4.1% 

5.7% 

6.3% 

6.7% 

7.4% 

11.1% 

11.2% 

18.0% 

CMP 

-3.0% 

-3.0% 

-3.0% 

-3.0% 

-3.0% 

-3.0% 

-2.4% 

-1.8% 

-1.2% 

-1.2% 

WC 

1.3% 

1.3% 

1.3% 

1.3% 

1.3% 

1.3% 

1.3% 

1.3% 

PUZZLE 

0.0% 

0.0% 

0.0%  1 

■imvAl 

0.0% 

0.0% 

0.0% 

0.7% 

COMPRESS 

1.3% 

2.0% 

4.0% 

7.3% 

9.3% 

9.9% 

11.3% 

13.9% 

17.9% 

GREP 

11.1% 

11.1% 

wmm\ 

11.1% 

11.1% 

13.0% 

13.0% 

13.0% 

14.8% 

24.1% 

LEX 

10.5% 

11.6% 

11.6% 

11.6% 

14.0% 

14.0% 

18.6% 

EQN 

7.8% 

fBS^WSSSMMSS^\ 

12.2% 

MS^\ 

13.9%  1 

13.9% 

13.9% 

13.9% 

YACC 

0.0% 

0.0% 

2.4% 

2.4% 

•7.1% 

•23.8% 

•28.6% 

CCCP 

8.5% 

9.3% 

10.1% 

11.6% 

11.6% 

*17.1% 

•17.1% 

•19.4% 

•20.9% 

•26.4% 

TBL 

5.3% 

7.9% 

7.9% 

7.9% 

7.9% 

14.5% 

14.5% 

14.5% 

15.8% 

f\J221£ 

COMPRESS 

WC 


Figure  11;  Percentage  of  hazard  nodes  that  are  branch  hazard  nodes 
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enough  to  resolve  branch  hazards  in  the  post-pass;  and  3)  resolving  branch  hazards  at  the  pseudo 
register  phase  or  the  machine  register  phase  is  likely  to  have  larger  code  growth,  due  to  the  extra 
node  splitting  and  loop  expansion.  In  most  benchmarks,  Scheme  1  even  outperforms  the  other 
two  schemes  in  both  code  run-time  and  code  growth,  e.g.,  QUEEN,  QSORT,  CMP,  COMPRESS, 
PUZZLE,  and  WC. 

The  performance  overhead  of  Scheme  1  is  tabulated  in  Table  4.  Due  to  the  heuristic  algorithm 
employed  in  the  post-pass  phase,  the  performance  overhead  we  observed  is  not  monotonically 
increasing  according  to  N.  The  code  generated  to  allow  N  instruction  rollback  certainly  can  work 
for  iV  ~  1  instruction  rollback  scheme.  Therefore,  we  can  record  the  overhead  non-decreasingly.  Ail 
twelve  benchmarks  successfully  pass  the  pseudo  register  phase  in  a  short  time.  However,  there  are 
several  functions  generating  more  than  15,000  nodes,  which  increases  the  computation  time  for  the 
machine  register  assignment  phase,  when  N  >  Q.  YACC  has  two  such  functions,  and  CCCP  has 
one.  For  these  three  functions,  we  resolve  the  rollback  hazards  of  distance  5  in  the  pseudo  register 
phase,  and  then  resolve  the  rollback  hazards  of  distance  iV  >  5  in  the  post-pass  phase,  as  marked 
by  in  Figure  4. 

Figure  11  depicts  the  percentage  of  hazard  nodes  that  are  branch  hazard  nodes  but  are  not 
on-path  hazard  nodes,  for  various  rollback  distance  N.  Benchmarks  QUEEN  and  QSORT  have 
0  percentage  for  N  within  10  because  either  they  have  no  branch  hazards,  or  all  of  their  branch 
hazards  are  also  on-path  hazards.  PUZZLE  has  the  highest  percentage  of  branch  hazard  nodes, 
42.42%  when  N  =  3.  There  is  a  sheer  rise  from  iV  =  2  to  Y  =  3  due  to  the  relative  distances 
between  branch  nodes  and  hazard  nodes.  This  can  explain  why  in  Scheme  A,  PUZZLE  has  the 
highest  run-time  overhead  10%  when  =  10  [llj.  The  post-p2iss  algorithms  apparently  trim  down 
the  overhead  to  0.7%,  as  shown  in  Table  4.  All  the  other  benchmarks  have  less  than  a  quarter  of 
hazard  nodes  that  are  branch  hazard  nodes  but  not  on-path  hzizard  nodes. 

9  Conclusion 

An  incremental  updating  scheme  has  been  incorporated  in  the  compiler-assisted  multiple  instruction 
retry  scheme,  resulting  in  significantly  reduced  compile  times.  To  improve  the  code  rm  time  and 
to  reduce  the  code  size,  several  approaches  have  been  applied.  By  identifying  the  cut  register  set. 
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save/restore  nodes  can  be  moved  out  of  the  loops  during  loop  protection.  The  code  in  the  prologue 
and  the  epilogue  segments  can  be  rescheduled,  and  the  spill  registers  can  be  reassigned  to  reduce 
the  total  number  of  nops  inserted.  The  threshold  for  the  number  of  nodes  increases  from  800  to 
15,000.  Branch  hazards  can  also  be  resolved  by  simple  modihcations  to  the  proposed  approaches. 
Based  on  the  types  of  hazards  resolved  at  the  three  different  phases,  we  have  implemented  three 
schemes  to  transform  the  programs  into  code  with  rollback  capability.  Among  them,  Scheme  1 
postpones  the  resolution  of  branch  hazards  to  the  last  phase,  and  hence  has  the  fastest  compilation 
speed.  It  also  typically  generates  code  as  good  as  the  other  two  schemes  in  both  code  run  time  and 
code  growth. 
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